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ABSTRACT 

An  experiment  is  a  test  under  controlled  conditions  to  investigate  the  validity  of  a  hypothesis. 
Experimentation  is  the  basis  for  creating  new  knowledge  -  determining  whether  some  factor 
causes  an  effect.  Well-designed  experiments  provide  a  systematic  approach  to  observe 
relations  among  variables  while  ruling  out  alternative  explanations.  This  report  has  examined 
a  number  of  experimental  designs  including  the  simple  experiment,  matched-pairs,  repeated- 
measures  and  the  single  group  design.  The  relevant  statistical  techniques  are  also  discussed  to 
help  identify  key  quantitative  methods  for  data  processing  and  analysis.  While  not  intending 
to  replace  any  textbook,  this  guide  summarises  the  resources  and  some  worked  examples  that 
have  helped  the  authors  gain  a  basic  understanding  of  design,  measurement  and  statistical 
analysis  to  support  military  experiments. 
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A  Short  Guide  to  Experimental  Design  and  Analysis 

for  Engineers 


Executive  Summary 

An  experiment  is  a  test  under  controlled  conditions  to  investigate  the  validity  of  a 
hypothesis.  Experimentation  is  the  basis  for  creating  new  knowledge  -  determining 
whether  some  factor  causes  an  effect.  The  basic  premise  of  this  evidence-based 
approach  is  to  discard  subjectivity  of  authorities,  and  to  seek  the  facts  from  scientific 
observation  of  certain  phenomena.  Well-designed  experiments  provide  a  systematic 
approach  to  observe  relations  among  variables  while  ruling  out  alternative 
explanations.  In  order  for  an  experiment  to  establish  cause-and-effect,  the  experiment 
must  have  internal-validity  -  that  is,  setting  up  the  conditions  that  allow  a  treatment's 
effects  to  be  isolated.  While  uncontrolled  experiments  such  as  those  conducted  in  the 
field  do  not  mitigate  all  confounding  factors,  they  nonetheless  provide  external- 
validity  to  support  the  results  from  controlled  experiments. 

This  report  has  examined  a  number  of  experimental  designs  including  the  simple 
experiment,  matched-pairs,  repeated-measures  and  the  single  group  design.  The 
relevant  statistical  techniques  are  also  discussed  to  help  identify  key  quantitative 
methods  for  data  processing  and  analysis.  While  not  intending  to  replace  any  textbook, 
this  guide  summarises  the  resources  and  some  worked  examples  that  have  helped  the 
authors  gain  a  basic  of  design,  measurement  and  statistical  analysis  to  support  military 
experiments. 
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1.  Introduction 


Science  is  derived  from  the  Latin  term  scientia,  which  means  "knowledge".  Accordingly, 
scientific  method  literally  means  the  "method  that  searches  after  knowledge."  Scientific 
method  originated  around  the  16*  century  when  people  found  that  when  data  was 
assembled  and  examined  without  bias,  some  previously  undiscovered  meaning  might  be 
revealed.  Traditionally,  scientific  method  seeks  to  understanding  the  unknown  by  (Leedy 
and  Ormrod,  2005): 

1.  identifying  a  problem  to  solve 

2.  establishing  a  hypothesis  that  if  confirmed  resolves  the  problem 

3.  gather  data  relevant  to  the  hypothesis 

4.  analysing  and  interpreting  the  data  to  determine  whether  the  hypothesis  is 
supported  or  not. 

Experimentation  is  a  central  aspect  of  scientific  method  and  is  the  basis  for  creating  new 
knowledge  -  determining  whether  some  factor  causes  an  effect. 

The  Logic  of  Warfighting  Experiments  (Kass,  2006)  and  GUIDEx  (Kass  et  al.,  2006)  provide 
quintessential  reading  for  planning  military  experiments.  However,  both  books  have 
deliberately  omitted  some  aspects  of  experimental  design  and  statistics.  Being  from  a 
different  background  (engineering),  the  authors  have  decided  to  collaborate  on  this  short 
guide  to  help  identify  key  topics  on  experimental  design,  measurement  and  analysis. 


2.  Experimental  Design 

An  experiment  is  a  scientific  procedure  used  to  test  a  hypothesis,  answer  a  question,  or 
prove  a  fact.  Two  common  types  of  experiments  are  simple  experiments  and  controlled 
experiments.  A  simple  experiment  is  a  specific  type  of  study  to  establish  a  cause-and-effect 
and  often  used  to  determine  the  effect  of  a  treatment.  Experiments  can  be  extremely 
complex  and  involve  a  multitude  of  variables.  In  a  complex  setting,  a  controlled 
experiment  is  considered  a  better  experiment  because  it  is  harder  for  other  factors  to 
influence  the  results,  which  could  lead  to  an  incorrect  conclusion. 

In  order  for  an  experiment  to  establish  cause-and-effect,  the  experiment  must  have 
internal-validity  -  that  is,  setting  up  the  conditions  that  allow  a  treatment's  effects  to  be 
isolated  (Mitchell  and  Jolley,  2010).  A  simple  experiment  provides  the  easiest  way  to 
establish  a  cause-and-effect  relationship,  but  this  approach  is  not  always  possible  or  viable 
due  to  real  world  constraints.  This  section  details  a  range  of  experimental  designs  and  the 
requirements  for  each  to  establish  causality. 
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2.1  Simple  Experiment 

The  goal  of  science  is  to  establish  and  advance  knowledge.  Experiments  provide  a 
systematic  procedure  for  scientists  to  observe  relations  among  variables  while  ruling  out 
alternative  explanations  (Nock  et  al.,  2008).  In  human-in- the-Ioop  experiments,  ideal  or 
simple  experiments  are  those  where  participants  are  randomly  assigned  to  either  a 
treatment  or  control  (comparison)  group.  The  idea  that  some  variable  has  an  effect  on 
another  is  tested  in  a  simple  experiment  by  considering  whether  measurements  from  a 
group  given  the  treatment  are  statistically  different  to  those  from  the  control  group. 
Examples  of  simple  experiments  where  participants  are  randomly  assigned  into  groups 
include: 

•  clinical  trials  testing  the  effectiveness  of  a  drug  where  one  group  is  given  the 
treatment  and  the  control  group  given  a  placebo, 

•  tests  to  establish  whether  some  widget  improves  performance  of  X  where  one 
group  is  given  the  new  device  and  the  control  group  employing  the  standard 
technique 

or 

•  experiments  to  examine  whether  a  process  change  is  more  efficient  than  the 
existing  process  where  groups  are  assigned  to  one  method  or  another. 

While  random  assignment  mitigates  selection  bias  as  an  explanation  for  differences 
between  treatments  (Slavin,  2007),  it  doesn't  produce  identical  groups.  Results  from  each 
group  will  have  statistical  variation  due  to  chance  (i.e.  different  group  means).  Testing  for 
statistical  significance  determines  whether  the  difference  between  groups  is  due  to 
something  other  than  random  error.  As  a  rule  of  thumb,  simple  experiments  should 
employ  at  least  30  participants  in  each  condition  to  minimise  error  due  to  differences 
between  participants  (Mitchell  and  Jolley,  2010). 

Table  1  summarises  the  possible  outcomes  of  a  statistical  significance  decision  from  an 
experiment.  A  correct  decision  is  made  when  the  decision  from  statistical  significance 
testing  agrees  with  the  actual  state  of  affairs.  A  type  1  error  occurs  when  the  analysis 
reports  a  treatment  as  having  an  effect  when  in  fact  it  does  not,  that  is  differences  due  to 
chance  are  mistaken  for  real  differences.  In  contrast,  a  type  2  error  occurs  when  a 
treatment  is  reported  as  not  having  an  effect  when  in  reality  there  is.  Under  these 
circumstances,  a  treatment  did  have  an  effect  but  the  study  failed  to  detect  it  (Mitchell  and 
Jolley,  2010). 
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Table  1:  Possible  outcomes  of  statistical  significance  decisions  (Mitchell  and  Jolley,  2010) 


True  State  of  Affairs 

Statistical  Sig¬ 

Treatment  has  an 

Treatment  doesn't 

nificance  Decision 

Effect 

have  an  Effect 

Significant:  Reject  the 
Null  Hypothesis 

Correct  decision 

Type  1  error 

Not  significant:  Don't 
reject  Null  Hypothesis 

Type  2  error 

Correct  decision 

Descriptive  statistics  are  techniques  for  describing  a  sample  and  include  measures  of 
central  tendency  (mean,  median  and  mode),  frequency  distributions  and  graphs. 
Comparisons  using  this  approach  on  data  from  two  groups  reveal  differences  between 
those  specific  groups  but  the  results  cannot  be  generalised  to  a  larger  population. 
Importantly,  comparing  the  group's  performances  by  comparing  means  this  way  to 
determine  the  treatment  effect  neglects  to  account  for  random  variations  due  to  chance.  To 
account  for  these  variations  and  for  conclusions  from  the  experiment  to  be  applied  more 
generally  require  techniques  from  inferential  statistics  such  as  the  t-test  and  ANOVA. 
These  and  other  techniques  are  described  in  Section  4  under  Hypothesis  Testing. 

Results  from  each  group  will  be  spread  about  some  mean  value.  These  variations  are 
termed  random  errors  and  may  be  due  to  (Mitchell  and  Jolley,  2010): 

•  random  measurement  error 

•  random  differences  between  testing  situations 

•  random  differences  between  participants 

•  data  entry  errors  when  coding  data. 

These  all  contribute  to  type  2  errors.  Ways  to  mitigate  these  errors  to  design  powerful 
experiments  include  taking  greater  care  in  data  entry,  standardising  procedures,  using 
reliable  measures,  increasing  numbers  of  trials  or  using  a  homogeneous  group  of 
participants.  Another  way  to  reduce  type  2  errors  is  to  create  a  larger  treatment  effect  such 
as  by  setting  higher  doses  or  increasing  difficulty  levels. 

Also  known  as  false  positives,  type  1  errors  occur  when  a  chance  difference  is  mistakenly 
considered  for  a  real  difference.  There  is  only  one  way  to  deal  with  type  1  errors  and  that 
is  to  decide  on  the  risk  of  making  such  an  error  through  significance  level,  a.  The  concept 
of  statistical  significance  is  most  easily  understood  by  an  example.  Suppose  a  coin  used  in 
coin  tossing  is  being  assessed  for  bias  because  previous  flips  revealed  a  tendency  for  heads 
to  appear.  An  experiment  to  test  for  bias  is  conducted  by  ten  flips  of  the  coin  with  the 
results  compared  against  the  chances  for  heads  to  appear  in  such  circumstances  for  a  fair 
coin.  The  chances  of  obtaining  8,  9  or  10  heads  in  10  tosses  from  a  fair  coin  is  shown  in 
Table  2.  As  shown  scoring  ten  heads  out  of  ten  tosses  is  not  impossible,  albeit  a  very  small 
chance. 
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Table  2:  Probability  occurring  from  10  coin  tosses  (Mitchell  and  Jolley,  2010). 


Event 


Probability  occuring  from  10  coin  tosses 


Expression 


% 


Decimal 


Calculations 


8  or  more  heads 


9  or  more  heads 


P{X  >  8)  =  P(X  =  8)  +  P(X  >  9) 


P{X>9)  =  P(X  =  9)  +  P(X=10) 


10  heads 


P(X=  10) 


5.47% 


1 .07% 


0.1% 


0.0547 


0.0107 


0.001 


^ioYivr^_o 

2 


10-8 


V  o  yv2y 


i-i 

V  2 


J 

\  10-9 


+  p(x  >  9) 

+  p(x  =  io) 


rio^ 

10 

I — 

[lO, 

vY 

1  2) 

10-10 


The  level  of  significance,  a  determines  the  level  of  risk  of  making  type  1  errors  when 
making  a  conclusion.  For  the  coin  toss  example,  setting  a  to  0%  (or  0.0)  implies  accepting 
no  risk  in  making  type  1  errors,  meaning  that  not  even  ten  heads  from  ten  tosses  will  be 
convincing  enough  to  declare  a  bias  coin  because  the  p-value  (chance)  of  0.001  is  larger 
than  a.  Setting  a  =  0.05  or  5%  implies  considering  the  coin  biased  when  scoring  nine  or 
more  heads  from  ten  tosses,  but  not  with  eight  or  more  heads.  Reducing  the  risk  of  making 
type  1  error  increases  the  probability  of  type  2  errors  because  genuine  treatment  effects 
(such  as  actually  using  a  biased  coin)  are  overlooked  (Mitchell  and  Jolley,  2010). 

Declaring  a  statistically  significant  result  implies  beyond  reasonable  doubt  that  differences 
observed  between  groups  is  due  to  the  treatment  and  not  chance.  Here,  reasonable  doubt 
is  taken  to  be  some  value  of  a  that  is  typically  set  to  0.05  or  5%.  For  this  reason,  we  often 
see  a  declaration  that  "the  results  were  statistically  significant  (p  <  0.05)"  (Mitchell  and 
Jolley,  2010). 


2.2  Matched-Pairs  Design 

While  a  simple  experiment  provides  one  of  the  easiest  ways  of  determining  whether  a 
factor  causes  an  effect,  practical  limitations  often  prevent  participants  from  being 
randomly  assigned  to  each  group.  In  some  cases,  randomly  assigning  at  least 
30  participants  from  a  population  to  each  group  poses  a  significant  challenge  that  prevents 
use  of  the  simple  experiment.  A  matched-pairs  design  is  an  alternative  approach 
combining  the  best  attributes  of  matching  and  random  assignment  that  requires  fewer 
participants  than  with  simple  experiments  (Mitchell  and  Jolley,  2010). 

Whereas  a  simple  experiment  attempts  to  minimise  random  errors  between  groups  by 
increasing  numbers  of  participants,  a  matched-pairs  design  seeks  to  do  so  by  creating 
control  and  treatment  groups  with  similar  attributes.  Similar  groups  are  created  by 
measuring  participants  with  respect  to  a  variable  correlating  with  the  dependent  measure. 
For  example,  participants  in  a  memory  experiment  would  be  given  a  memory  test  that 
then  allows  them  to  be  ranked  according  to  their  scores.  Paring  the  two  highest  scores  and 
repeating  this  step  for  the  remaining  ranked  participants,  then  randomly  assigning  one 
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member  of  each  pair  to  either  the  control  or  treatment  group  produces  the  similar  groups 
(Mitchell  and  Jolley,  2010). 


2.3  Repeated-Measures  Design  (Within-Subjects  Design) 

The  ultimate  way  to  eliminate  random  error  due  to  individual  differences  is  to  employ  the 
same  group  for  both  treatment  and  control.  In  repeated-measures  designs,  each  participant 
receives  all  types  of  treatments  administered  in  the  experiment  and  measured  after  each 
type  of  treatment.  A  restriction  with  a  between-subjects  design  such  as  a  matched-pairs 
design  or  simple  experiment  is  the  limit  of  one  observation  per  participants.  Within- 
subjects  designs  however  allow  getting  at  least  two  observations  per  participant.  For 
example,  in  a  study  of  women's  ratings  of  men  based  on  masculinity,  Frederick  and 
Haselton  had  participants  performing  octuple  duty  -  providing  ratings  of  attractiveness 
for  eight  drawings  that  varied  in  muscularity  (Mitchell  and  Jolley,  2010).  However,  a 
potential  risk  with  a  repeated-measures  design  relates  to  order  (trial)  effects  due  to  taking 
multiple  observations  from  the  same  participant.  That  is,  in  a  within-subjects  design  the 
treatment  may  not  be  the  only  factor  being  manipulated  due  to  order  effects  being  a 
confounding  factor.  Mitchell  and  Jolley  (2010)  identify  four  sources  of  order  effects: 

Practice:  learning  from  earlier  treatments  improves  performance  in  subsequent  trials 

Fatigue:  performance  decreases  due  to  tiredness  in  subsequent  trials 

Treatment  carryover:  effects  of  earlier  treatments  carry  to  responses  in  latter  trials 

Sensitisation:  participants  realise  what  the  independent  and  dependent  variables  in 
latter  parts  of  the  experiment  and  may  act  to  support  the  hypothesis  rather  than 
reacting  to  the  treatment 

Randomising  the  order  of  trials  is  one  way  of  balancing  out  order  effects,  while  giving 
participants  extensive  practice  before  the  experiment  minimises  practice  effects.  Treatment 
carryover  effects  can  be  reduced  by  allowing  sufficient  time  for  the  effects  from  a  previous 
treatment  to  wear  off.  Reducing  the  demands  of  a  treatment  or  shortening  the  duration 
helps  minimise  fatigue  effects  while  minimising  sensitisation  effects  might  involve 
preventing  participants  from  noticing  the  changes  between  treatments. 


2.4  Single  Group  Designs 

Whilst  some  Defence  human-in-the-Ioop  experiments  employ  a  single  operator  such  as  for 
evaluating  cockpit  design  or  a  command  support  system,  command  and  control  (C2) 
experiments  typically  involve  a  unit  (group)  where  it  is  difficult  to  allocate  a  unique  team 
to  each  trial  due  to  a  small  pool  of  participants.  For  this  reason,  the  same  unit  receives  all 
treatment  conditions  in  a  single  group  design  (Kass  et  al.,  2006).  As  in  repeated-measures 
designs,  order  effects  are  a  risk  for  establishing  cause-and-effect  in  single  group  designs 
because  latter  trials  may  be  positively  biased  due  to  learning  or  memory  effects  or 
negatively  biased  from  fatigue.  Kass  et  al.  suggest  using  a  counterbalanced  design  to 
reduce  order  effects  (see  Table  3). 
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Table  3:  A  counterbalanced  experimental  design. 


Mon 

Tue 

Wed 

Thu 

Current 

Future 

Future 

Current 

If  units  are  presented  with  identical  scenarios  for  all  conditions,  the  appropriate  statistical 
analysis  technique  is  the  paired  t-test  or  repeated-measures  (M)ANOVA  (Gueorguieva 
and  Krystal,  2004).  Otherwise,  if  scenarios  between  conditions  are  similar  but  not  exactly 
the  same,  then  the  t-test  or  (M)ANOVA  becomes  the  appropriate  method  for  analysis. 


2.5  Exploratory  /  Uncontrolled  Experiments 

Defence  experiments  such  as  the  study  of  autonomous  workgroups  in  dynamic  targeting 
in  Exercise  Pitch  Black  2008  (Lo  et  al.,  2013)  or  the  CAGE  series  (Lo  et  al.,  2014)  of  joint 
fires  activities  have  been  uncontrolled  in  nature.  An  uncontrolled  experiment  does  not 
mitigate  confounding  factors,  thus  preventing  statements  about  cause-and-effect  from 
being  determined.  In  CAGE3A  (run  in  2013),  issues  included: 

1.  order  effects  (memory,  fatigue  and  learning  effects) 

2.  controlled  variables  were  not  held  constant  (IT  systems  regularly  failing) 

3.  independent  variables  were  not  held  constant  under  the  same  condition  (removing 
players,  removing  monitors,  introducing  unplanned  systems,  changing  seating 
from  days  2  or  3  in  the  To-Be  week) 

4.  multiple  hypotheses  being  tested  in  a  single  experiment. 

Ideally,  an  experiment  should  only  test  a  single  hypothesis.  According  to  Neuman  (2006), 
an  experiment  is  rarely  appropriate  for  research  questions  that  require  studying  the  impact 
of  dozens  of  diverse  variables  simultaneously.  Rarely  do  experiments  enable  assessing 
conditions  across  a  wide  range  of  complex  settings  or  numerous  social  groups  all  at  the 
same  time.  During  CAGE3A,  causes  from  one  hypothesis  appeared  to  impact  another, 
meaning  that  an  alternative  cause  not  identified  in  the  hypothesis  was  responsible  for  the 
effect.  Eor  this  reason,  uncontrolled  experiments  only  produce  anecdotal  evidence 
supporting  or  refuting  experimental  objectives  (Lo  et  al.,  2014).  However,  this  is  not  to  say 
that  uncontrolled  experiments  aren't  important.  Eor  example,  field  experiments  as 
exploratory  experiments  examining  interventions  in  the  real  world  can  provide  valuable 
insights  into  the  sociotechnical  system,  identify  problems  to  investigate  or  gain  new 
insights  to  develop  hypotheses.  A  campaign  of  controlled  and  uncontrolled  experiments 
demonstrates  internal  and  external  validity  for  cause-and-effect  statements. 


3.  Measuring  Data  and  Descriptive  Statistics 

Studies  of  sociotechnical  systems  invariably  involve  gauging  people's  behaviours, 
characteristics,  attitudes  and  opinions.  Techniques  to  facilitate  the  quantification  and 
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evaluation  of  possibly  complex  behaviours  and  attitudes  include  the  checklist  and  the 
rating  scale.  A  checklist  is  a  list  of  behaviours  or  characteristics  under  investigation  that 
allows  the  researcher  or  participant  to  tick  off  if  observed,  present  or  true  or  vice-versa.  A 
rating  scale  is  useful  when  a  behaviour,  attitude  or  opinion  needs  to  be  evaluated  on  an 
interval  scale  of  measurement  such  as  in  Table  1.  Developed  by  Rensis  Likert,  rating  scales 
are  sometimes  referred  as  Likert  Scales  (Leedy  and  Ormrod,  2005).  A  Likert  scale  has  the 
advantage  of  producing  responses  as  quantitative  data  reflecting  degrees  of  opinion. 

Table  4:  Likert  scales  can  generate  quantitative  metrics  from  a  variety  of  questions^ 


(Low) 

Likert  Scale 

(High) 

1 

2 

3 

4 

5 

Agreement 

Strongly 

Disagree 

Disagree 

Undecided 

Agree 

Strongly 

Agree 

Frequency 

Never 

Rarely 

Occasionally 

Frequently 

Very 

Frequently 

Importance 

Unimportant 

Of  Little 
Importance 

Moderately 

Important 

Important 

Very 

Important 

Likelihood 

Almost  Never 

True 

Usually  Not 
True 

Occasionally 

True 

Usually  True 

Almost 
Always  True 

3.1  Descriptive  Statistics 

Once  collated,  performance  between  specific  groups  can  be  compared  using  descriptive 
statistics  for  describing  a  sample  including  measures  of  central  tendency  (mean,  median 
and  mode),  frequency  distributions  and  graphs.  Measures  of  variability  include  dispersion 
and  deviation  while  a  measure  of  relationship  is  correlation.  However,  correlation  doesn't 
necessarily  indicate  causation  (Leedy  and  Ormrod,  2005).  Results  from  descriptive 
statistics  cannot  be  generalised  to  a  larger  population.  Generalising  to  a  larger  population 
requires  inferential  stahstics  as  discussed  in  Section  4. 

3.2  Visualising  Results 

An  alternative  approach  to  comparing  numeric  results  through  descriptive  stahstics  is 
through  visualisation.  This  section  surveys  some  notable  approaches  including, 

1.  Radar  /  Spider  Chart 

2.  Box  and  Whisker  Plots 

3.  100%  Stacked  Bar 

4.  comparing  estimates  of  density  functions  fitted  from  the  data. 

Choice  of  the  approach  is  dependent  on  the  type  of  input  data  (naturally  qualitative  versus 
those  derived  from  a  Likert  scale)  or  the  number  of  dependent  variables. 


*  S.  A.  McLeod  (2008),  Likert  Scale  -  Simply  Psychology,  http://www.simplvpvschologv.org/likert- 
scale.html.  accessed  4  April  2014 
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1.1.1  Radar  /  Spider  Chart 

A  radar  or  spider  chart  such  as  shown  in  Figure  1  allows  one  set  of  multi-dimensional  data 
to  be  compared  against  another  (Few,  2005).  In  a  C2  experiment,  average  performance 
across  a  number  of  variables  such  as  communication,  leadership,  team  workload,  shared 
situation  and  others  may  be  compared  against  mean  scores  after  the  application  of  a 
treatment.  Another  use  of  radar  charts  is  in  comparing  performance  over  the  days  for  a 
specific  characteristic  such  as  timeliness. 


Sales 


Allocated  Budget 
ActualSpending 


Customer 

Support 


Figure  1:  An  example  radar  chart  comparing  sets  of  multi-dimensional  data^ 

1.1.2  Box  and  Whisker  Plots 

A  box  and  whisker  plot  aggregates  a  box  plot  showing  the  first,  second  and  third  quartile 
levels  with  whiskers  representing  another  pair  of  user-defined  large  and  small  values 
(Mergerdichian  et  al.,  2012).  Popular  values  presented  by  whiskers  include: 

•  mean  value  ±  one  standard  deviation, 

•  maximum  /  minimum  values  sampled 

•  2nd  and  98*  percentile. 

Box  plots  provide  a  quick  way  to  assess  statistical  differences  between  samples  for  a 
specific  measure.  Placed  side-by-side,  a  pair  of  box  and  whisker  plots  provide  a  graphical 
way  to  compare  statistics  of  a  measure  from  two  or  more  samples  (see  example  in 
Figure  2). 


^  Wikipedia  Radar  Chart  http://en.wikipedia.org/wiki/Radar  chart,  accessed  July  2013 
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1  2  3  4  5 

Experiment  No. 


Figure  2:  Examples  of  box  and  whisker  plots^ 

1.1.3  100%  Stacked  Bar 

A  100%  stacked  bar  enables  comparison  of  distributions  within  categories.  Each  row  of  the 
graph  represents  all  responses  such  as  from  Likert  scales  for  that  category.  By  matching 
responses  across  rows  according  to  common  ratings  (such  as  Strongly  Disagree,  Disagree, 
etc.)  a  visual  comparison  can  be  made  across  categories  (see  Eigure  3). 


The  Restructure  has  Improved  Workplace  Efficiency 


■  Strongly  Disagree  ■  Disagree  Neutral  ■  Agree  ■  Strongly  Agree 

Figure  3:  Example  of  a  100%  stacked  bar. 

1.1.4  Comparing  Distributions 

If  the  form  of  the  distribution  is  known,  the  parameters  may  be  evaluated  algebraically  or 
in  some  cases,  solved  numerically  from  histograms  generated  from  the  set  of 
measurements.  Eor  example,  solving  the  parameters  for  a  Normal  distribution  simply 
involves  evaluating  the  mean  and  standard  deviation  from  the  sample,  while  doing  so  for 
a  Rayleigh  distribution  requires  non-linear  optimisation.  Standard  techniques  for  the  latter 
include  the  Gauss-Newton,  the  Levenberg-Marquardt  and  Powell's  Dog  Leg  method 
(Madsen  et  al.,  2004). 


^  Wikipedia  Box  Plot  http://en.wikipedia.orE/wiki/Box  plot,  accessed  July  2013 
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Figure  4:  Fitting  a  Rayleigh  distribution  over  a  histogram. 


Once  fitted,  the  resulting  distributions  may  be  compared  side-by-side  as  shown  in 
Figure  5. 


Figure  5:  Comparing  two  example  Normal  distributions^. 

The  Kolmogorov-Smirnov  test  evaluates  numerically  the  difference  between  two  samples 
(e.g.,  and  F2)  (Sheskin,  2004)  or  a  sample  and  its  reference  distribution  (e.g.,  F)  (Papoulis 
and  Pillai,  2002).  The  distance  is  the  maximum  vertical  distance  between  two  cumulative 
distribution  functions  given  by: 

d  =  max|Fi(x)  —  F2  Wl 
or 

d  =  max|Fi(x)  —  F(x)| 

a: 

respectively. 


“  http://www.researchgate.net/post/. . . .  accessed  July  2013 
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4.  Hypothesis  Testing  and  Inferential  Statistics 

An  experiment  is  a  test  under  controlled  conditions  to  investigate  the  validity  of  a 
hypothesis.  A  hypothesis  gives  a  reasonable  or  educated  guess  to  explain  a  phenomenon 
under  investigation  (Leedy  and  Ormrod,  2005,  Salkind,  2008).  According  to  the  Collins 
Dictionary  the  term  is  derived  from  the  Greek  word  hupotithenai,  meaning  to  propose, 
suppose  or  put  under.  Proving  a  hypothesis  is  unfeasible  because  doing  so  would  involve 
testing  the  entire  population  for  validity  (Salkind,  2008).  Instead,  hypothesis  or 
significance  testing  involves  experimenting  on  a  sample  and  determining  whether  the 
results  provide  sufficient  evidence  to  support  the  hypothesis.  A  well  designed  experiment 
seeks  to  mitigate  sampling  bias  by  ensuring  samples  approximate  the  characteristics  of  the 
population. 

A  variable  in  an  experiment  is  any  factor  that  can  be  introduced,  changed,  measured  or 
controlled.  Independent  variables  are  factors  that  change  in  an  experiment  and  are 
associated  to  the  cause(s).  Dependent  variables  relate  to  the  effect(s)  and  are  measured  in 
an  experiment.  Controlled  variables  must  be  held  constant  throughout  all  treatments  in  an 
experiment.  Allowing  controlled  variables  to  change  causes  them  to  become  independent 
variables  (Slavin,  2007).  Confounding  variables  are  additional,  unaccounted  for  variables 
in  an  experiment.  These  confounding  factors  threaten  the  validity  of  conclusions  made 
from  an  experiment  because  the  treatment  isn't  the  sole  factor  accounting  for  observed 
effects  (Leedy  and  Ormrod,  2005). 

A  good  hypothesis  captures  a  problem  statement  or  research  question  in  a  form  that  is 
more  amenable  to  testing.  Hypothesis  testing  involves  studying  the  extent  to  which  the 
independent  variable  (the  cause  or  treatment)  influences  the  dependent  variable  (the 
effect)  (Leedy  and  Ormrod,  2005).  The  research  or  alternative  hypothesis  is  a  definitive 
statement  of  a  relationship  between  variables.  Here,  the  term  variable  refers  to  any  quality 
or  characteristic  being  investigated  that  has  two  or  more  possible  values.  The  null 
hypothesis  states  the  converse  of  the  research  hypothesis,  that  is  no  relationship  exists 
between  the  variables  (Salkind,  2008). 

Descriptive  statistics  as  described  in  Section  3  are  techniques  for  describing  a  sample  and 
include  measures  of  central  tendency  (mean,  median  and  mode),  frequency  distributions 
and  graphs.  While  comparisons  using  this  approach  on  data  from  two  groups  reveal 
differences  between  those  specific  groups,  the  results  cannot  be  generalised  to  a  larger 
population.  Simply  comparing  means  to  determine  treatment  effects  neglects  to  account 
for  random  variations  due  to  chance.  To  account  for  these  variations  and  for  conclusions  to 
be  applied  more  generally  require  techniques  from  inferential  statistics  such  as  the  Z-test, 
t-test,  paired  t-test,  ANOVA  and  MANOVA  described  in  this  section.  Non-parametric 
approaches  include  the  Mann-Whitney  or  Wilcoxon  signed-rank  test. 

Hypothesis  testing  determines  whether  the  experimental  evidence  provides  statistical 
support  for  the  null  hypothesis  or  its  alternative  by  computing  a  p-value  that  gives  the 
probability  that  the  null  hypothesis  is  wrong.  Typically,  the  p-value  is  compared  against  a 
significance  level  of  a  =  0.05  that  requires  the  availability  of  moderate  evidence  for 
rejecting  the  null  hypothesis  in  favour  of  the  alternative  (see  Table  5). 
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Table  5:  Interpreting  p-values  when  conducting  hypothesis  testing. 


p -value 

Meaning 

p  >  0.1 

0.05  <  p  <  0.1 
0.01  <  p  <  0.05 

0.001  <  p  <  0.01 
p  <0.001 

No  evidence  against  null  hypothesis 

Weak  evidence  against  null  hypothesis  in  favour  of  the  alternative 

Moderate  evidence  against  null  hypothesis  in  favour  of  the  alternative 

Strong  evidence  against  null  hypothesis  in  favour  of  the  alternative 

Very  strong  evidence  against  null  hypothesis  in  favour  of  the  alternative 

The  value  of  a  establishes  the  level  of  reasonable  doubt,  that  is  wrongly  rejecting  the  null 
hypothesis  when  it  is  true.  The  confidence  interval  is  evaluated  by  1  —  a.  Setting  a  =  0 
implies  never  rejecting  the  null  hypothesis. 


4.1  Central  Limit  Theorem 

Evidence  to  support  or  refute  the  null  hypothesis  is  gathered  by  conducting  multiple  trials 
to  assess  the  outcomes.  We  can  represent  the  set  of  n  outcomes  as  random  variables 
Zi,  ...Z„.  That  is,  each  trial  is  a  random  process  that  generates  results  according  to  some 
distribution,  i.e.  normal,  bimodal,  uniform,  exponential...  The  term,  sample  is  frequently 
overused  in  statistics  and  can  either  mean  the  outcome  of  a  single  trial  or  a  set  of  n 
outcomes.  In  hypothesis  testing,  the  term  typically  refers  to  the  latter,  with  X-^,  ...X^  being 
summarised  by  its  mean  value,  X. 

One  of  the  fundamental  theorems  in  probability  is  the  Central  Limit  Theorem  (CLT) 
relating  the  distribution  generating  each  trial  outcome  to  the  sampling  distribution,  X.  It 
states  that  if  X  is  the  mean  of  n  mutually  independent  random  variables  taken  from  any 
population  with  mean  p  and  standard  deviation  a,  then  the  probability  distribution  of: 

X-n 

a 

(1) 

tends  to  the  standard  normal  probability  distributions  A  (0,1)  as  n  ^  oo  (Smith,  1993). 
Similarly,  the  distribution  of  X  tends  to  A(/i,  oj'Jn)  as  n  ^  oo.  The  effect  can  be  illustrated 
using  convolution  because  if  X  and  Y  are  independent  random  variables  with  density 
functions  fxipd)  and  fyiy)  defined  for  all  x  and  y,  then  the  sum  Z  =  X  +  Y  is  a  random 
variable  with  density  (Grinstead  and  Snell,  1997,  Papoulis  and  Pillai,  2002): 

fz{^)=  friy)  (2) 

where  convolution  is  defined  as  (Liu  and  Liu,  1975), 

/(f)0g(f)  =  |  /(r)g(f-r).flfr.  (3) 

Since 


^  S.  Khan  (2011)  Central  Limit  Theorem,  http://www.khanacademv.org/math/probabilitv/statistics- 
inferential/ sampling  distribution,  accessed  April  2014 
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then 


n 


n  n 


(4) 


fxi^)  =  fx 


(x\ 

A  ' 

1  X 
- 

ytij 

\n) 

(5) 


For  example,  consider  a  stochastic  process  with  a  uniform  distribution.  The  sampling 
distribution  for  n  =  2  is  a  triangular  distribution  that  can  be  generated  by  scaling  the 
uniform  distribution  by  a  factor  of  0.5  followed  by  convolving  the  resulting  scaled 
distribution  with  itself.  For  any  original  distribution,  as  n  increases,  the  sampling 
distribution  increasingly  resembles  a  normal  distribution  and  is  the  basis  for  the  CLT. 


4.2  Z-Statistic 

In  some  situations,  the  population's  statistical  properties  are  known  so  the  experiment 
only  involves  testing  the  experimental  group  and  assessing  the  probability  (the  p-value) 
that  the  results  come  from  the  population  with  known  statistical  properties.  For  example, 
body  mass,  height  and  intelligence  quotient  (IQ)  scores  have  known  population  means 
and  variances,  enabling  hypothesis  testing  using  a  single  sample  statistic.  A  significance 
test  using  the  Z-statistic  is  useful  for  large  values  of  n  because  the  following  statistic  takes 
on  a  normal  distribution: 


Z  = 


<J 


X 


(6) 


Here,  X  gives  the  sample  mean,  gives  the  mean  of  the  sampling  distribution  and  cr^  the 
standard  deviation  of  the  sampling  distribution.  Although  the  mean  and  standard 
deviation  of  the  sampling  distribution  is  often  unknown,  if  we  assume  the  null  hypothesis 
that  the  sample  was  taken  from  the  population,  then  =  p  and  =  cr/Vn  where  p  and  o 
gives  the  mean  and  standard  deviation  of  the  population  respectively.  This  allows  the 
Z-statistic  to  take  on  the  following  form: 


Z  = 


X-pi 

<J 


(7) 


According  to  the  CLT,  when  the  sample  size  is  large  (n  ^  co),  samples  from  X  will  appear 
normally  distributed  for  arbitrary  population  distributions.  However,  when  the 
population  distribution  appears  normal,  even  when  o  is  unknown  it  is  possible  when 
n  >  20  or  30  (Carlberg,  2011)  to  still  use  the  Z-statistic  by  assigning  a  =  Sx,  the  standard 
deviation  of  the  sample  means,  for: 
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Z  « 


x-^ 


for  large  n . 


(8) 


Most  natural  populations  have  a  particular  mathematical  form  that  is  termed  normally 
distributed  (Salkind,  2008). 


The  following  example  by  Conley  and  Pollard^  is  used  to  illustrate  hypothesis  testing 
using  Z-statistics: 

Suppose  we  are  interested  in  the  following  research  hypothesis: 

Hi:  The  IQs  of  Yale  students  are  higher  than  the  general  population. 

After  taking  a  random  sample  of  35  students  and  measuring  their  IQs,  we  find  that 
their  mean  score  is  107.  Population  IQ  scores  has  a  mean  of  100  and  a  standard 
deviation  of  15. 


The  first  step  involves  assuming  that  the  sample  was  taken  from  the  general  population; 
that  is  evaluating  the  probability  of  sampling  n  =  35  students  whose  mean  IQ  score  is  107, 
comes  from  the  general  population.  Leveraging  off  the  CLT  enables  the  standard  deviation 
of  the  sample  means  for  such  a  sample  size  to  be  computed: 


2.535 

V35 


This  can  then  be  used  to  work  out  a  Z-score  to  evaluate  how  many  standard  deviations  the 
result  is  from  the  mean: 


Z  = 


X-pL 

107-100 

2.535 


2.761 


That  is,  when  sampling  35  students  whose  average  IQ  score  equals  107  is  2.761  standard 
deviations  away  from  the  expected  mean  of  100  (from  the  CLT).  Considering  that  we  are 
dealing  with  a  one-tailed  test,  our  focus  is  in  the  area  to  left  of  the  Z-score  which  in  our 
case  is  0.9971  (99.71%),  in  other  words  the  sample  score  is  higher  than  99.71%  of  all  other 
samples  comprising  of  35  students.  The  corresponding  p-value  is  1-0.9971  =  0.0029 
which  is  29  chances  out  of  10,000  chances  that  this  would  occur  if  we  assume  that  there  is 
nothing  special  about  the  group. 


In  statistics,  it  is  customary  to  use  a  significance  level  of  a  =  0.05  corresponding  to 
moderate  evidence  supporting  the  null  hypothesis.  This  level  of  a  implies  a  willingness  to 

®  D.  Conley  and  D.  Pollard  (1998)  Hypothesis  Testing,  Confidence  Intervals,  and  Power, 
http ://www. vale. edu/ soc  1 1 9a/lecture7.htm.  accessed  April  2013 
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be  wrong  once  out  of  20  times.  For  the  example,  for  a  =  0.05,  we  are  confident  in  rejecting 
the  null  hypothesis  because  the  p-value  of  0.0029  <  a.  In  fact,  strong  evidence  is  available 
to  support  the  alternative  hypothesis. 


4.3  Standard  t-Test 


Without  understanding  the  population  statistics  (its  mean  and  standard  deviation),  the 
experiment  needs  to  be  conducted  on  a  control  group  without  the  treatment  applied.  The 
standard  t-test  is  used  to  compare  the  means  from  exactly  two  groups  and  is  the 
appropriate  statistical  test  for  such  a  case.  It  detects  whether  a  statistical  difference  exists 
between  the  two  group's  results  through  a  p-value  expressing  the  probability  that  the  null 
hypothesis  is  wrong.  Assumptions  for  using  the  standard  t-test  include: 

1.  the  two  populations  compared  should  both  be  Normally^  distributed 

2.  the  two  populations  should  have  the  same  variance. 


Whereas  the  Z-statistic  takes  on  a  normal  distribution  for  large  sample  sizes,  it  is  not  the 
case  for  smaller  n.  That  is,  for  smaller  n,  a  is  not  closely  approximated  by  5^  and  the 
resulting  statistic  takes  on  a  t-distribution: 


X-// 


for  small  n . 


(9) 


The  shape  of  the  t-distribution  depends  on  n:  flatter  for  smaller  values  of  n  and  taking  the 
shape  of  the  Normal  distribution  as  the  sample  size  increases  (Carlberg,  2011).  Its  pdf  is 
defined  as  follows  (Papoulis  and  Pillai,  2002): 


r(0  +  i)/2) 

•fmr{n  12) 


1  + 

V 


2  ^ 


X 


-(«+0 

2 


n 


) 


for  -  00  <  X  <  00  . 


(10) 


Here,  the  mean  p  =  0  and  standard  deviation  is  cr  = 


for  n  >  2. 


After  computing  the  t-statistic,  the  value  is  compared  against  a  threshold  from  a  t-test 
table  giving  the  t-value  required  to  reject  the  null  hypothesis. 


Population  statistics  are  not  always  available,  so  a  separate  control  group  is  used  to 
estimate  the  population  statistics.  The  t-statistic  divides  the  difference  between  group 
means  by  the  variation  within  and  between  the  two  groups  (Salkind,  2008): 


t  = 


[(«!  -iX"  +(«2  -iK 

ttj  -1-  ttj 

^1  +  ^2  “2 

(11) 


Here, 


^  Tested  using  the  Kolmogorov-Smirnov  or  Shapiro-Wilk  test. 
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Ai: 

Mean  for  group  1 

^2: 

Mean  for  group  2 

np 

Number  of  participants  in  group  1 

772 : 

Number  of  participants  in  group  2 

„2. 

Variance  for  group  1 

„2. 

*2  • 

Variance  for  group  2 

When  the  obtained  value  from  Equation  11  is  greater  than  the  critical  value  from  a 
standard  t-table,  then  the  null  hypothesis  can  be  rejected.  Looking  up  the  critical  value 
from  a  t-table  requires  knowing  the  degrees  of  freedom  df,  the  chosen  level  of  significance 
a  and  whether  a  one  or  two-tailed  test  is  being  used.  The  t-test  for  independent  means 
defines  degrees  of  freedom  as  follows  (Carlberg,  2011): 


df  =  -1  +  n2  -I 

=  -  2 


(12) 


This  allows  the  computed  t-score  to  be  referenced  as  t(d/). 


The  following  example  illustrates  applying  a  t-test  to  data  obtained  from  two  groups 
(Salkind,  2008). 

A  programme  has  been  designed  to  help  Alzheimer's  patients  remember  the  order 
of  daily  tasks.  Group  1  was  taught  using  visuals  while  group  2  was  taught  using 
visuals  and  intense  verbal  rehearsal.  The  data  below  counts  the  number  of  words 
remembered  by  each  group. 


Group  1 

7  5  5  3  4  7 

3  6  1  2  10  9 

3  10  2  8  5  5 

8  1  2  5  1  12 

8  4  15  5  3  4 


Group  2 

5  3  4  4  2 

4  5  2  5  4 

5  4  6  7  6 

8  7  8  8  7 

9  5  7  8  6 


Here,  X-^  =  5.43,  X2  =  5.53,  =  3.42,  S2  =  2.06,  =  30  and  ni  =  30.  The  null  and  research 

hypotheses  are  as  follows: 

di  —  d-2 
Hp  ^  A2 

Hi  is  in  the  form  of  a  two-tailed  non-directional  research  hypothesis.  The  level  of  a 
(measuring  accepted  risk  /  Type  1  error  /  level  of  significance)  is  chosen  to  be  0.05.  Here, 
the  appropriate  test  is  a  t-test  for  independent  means  rather  than  one  for  dependent  means 
because  the  groups  are  independent  of  one  another  while  degrees  of  freedom  df  =  3Q  + 
30  -  2  =  58: 
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^(58)  -  ■ 


5.43-5.53 


[(30-1)3.42"  +(30-1)2.06"" 

■30  +  30' 

30  +  30-2 

_30x30_ 

-O.I 


^339.20  + 123.06  Y  60  ^ 
900  y 


56 


yv 


=  -0.I8 


For  this  problem,  there  is  a  requirement  to  identify  the  critical  value  where  df  =  58  and 
a  =  0.05  for  a  two-tailed  t-test.  Unfortunately,  the  standard  tables  for  t-values  do  not  give 
values  for  df  =  S8,  only  df  =  SS  or  60.  Salkind  (2008)  suggests  choosing  a  degrees  of 
freedom  closest  to  that  desired,  in  this  case  being  d/  =  60.  For  this  example,  the  obtained 
value  to  reject  the  null  hypothesis  needs  to  be  equal  or  exceed  tcrit  =  2.001.  The  obtained 
value  of  -  0.18  is  less  than  2.001  so  the  null  hypothesis  cannot  be  rejected.  That  is,  the  null 
hypothesis  is  the  most  attractive  explanation. 


4.4  Paired  t-Test 

Also  referred  to  as  the  repeated  measures  t-test,  the  paired  t-test  is  used  when  data  has 
been  collected  from  a  single  group  of  participants  who  are  tested  twice:  before  and  after  a 
treatment.  While  similar  to  the  standard  t-test,  the  availability  of  pre  and  post  intervention 
data  for  each  subject  allows  the  two  sets  of  results  to  be  aggregated  by  subtracting  one 
from  the  other.  Here,  the  null  hypothesis  is  demonstrated  by  showing  that  both  sets  of 
results  come  from  the  same  population:  /ipost  =  Mpre  while  assuming  that  Upost  =  o-pj-g.  As 
with  the  standard  t-test,  assumptions  for  the  paired  t-test  include: 

1.  the  two  populations  compared  should  both  be  Normally^  distributed 

2.  the  two  populations  should  have  the  same  variance. 

The  test  statistic  for  the  paired  t-test  is  (Salkind,  2008): 

_ 

V  n-l  ^^3^ 

where: 

Sum  of  all  differences  between  groups  of  scores 
Sum  of  all  differences  squared  between  groups  of  scores 
n:  Numbers  of  pairs  of  observations 


*  Tested  using  the  Kolmogorov-Smirnov  or  Shapiro-Wilk  test. 
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As  with  the  standard  t-test,  the  paired  t-test  is  also  referred  to  by  degrees  of  freedom 
although  n  in  the  following  is  referring  to  the  pairs  of  observations  (Carlberg,  2011): 

df  =  n -I  (14) 

allowing  the  computed  t-score  to  be  referenced  as  t(d/)-  When  the  obtained  value  from 
Equation  13  is  greater  than  the  critical  value  obtained  from  a  standard  t-table,  the  null 
hypothesis  can  be  rejected.  The  critical  value  is  obtained  from  the  cut-off  for  a  t-statistic 
corresponding  to  a  df  value  for  some  chosen  level  of  significance  a  using  a  one  or  two- 
tailed  test. 

The  following  example  from  Salkind  (2008)  demonstrates  hypothesis  testing  using  a 
paired  t-test: 

Twenty-five  participants  were  each  tested  before  and  after  a  treatment  with  the 
scores  given  in  the  table  below. 


1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

Participant 
12  13  14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

Pre 

3 

5 

4 

6 

5 

5 

4 

5 

3 

6 

7 

8  7  6 

7 

8 

8 

9 

9 

8 

7 

7 

6 

7 

8 

Post 

7 

8 

6 

7 

8 

9 

6 

6 

7 

8 

8 

7  9  10 

9 

9 

8 

8 

4 

4 

5 

6 

9 

8 

12 

The  null  and  research  hypotheses  are  as  follows: 

Hq-  Mpost  ~  /^pre 
Hi-  -^post  ^  -^pre 

as  a  one-tailed,  directional  research  hypothesis.  Eor  this  experiment,  the  level  of 
significance  or  risk  is  chosen  to  be  a  =  0.05  for  a  paired  t-test.  Testing  the  null  hypothesis 
here  is  equivalent  to  testing  whether  /ipost  ~  Mpre  =  0- 

The  difference  and  squared  difference  values  between  the  two  sets  of  scores  are  given 
below: 


1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

Participant 
12  13  14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

Pre  3 

5 

4 

6 

5 

5 

4 

5 

3 

6 

7 

8 

7 

6 

7 

8 

8 

9 

9 

8 

7 

7 

6 

7 

8 

Post  7 

8 

6 

7 

8 

9 

6 

6 

7 

8 

8 

7 

9 

10 

9 

9 

8 

8 

4 

4 

5 

6 

9 

8 

12 

□iff  4 

3 

2 

1 

3 

4 

2 

1 

4 

2 

1 

-1 

2 

4 

2 

1 

0 

-1 

-5 

-4 

-2 

-1 

3 

1 

4 

Diff'  16 

9 

4 

1 

9 

16 

4 

1 

16 

4 

1 

1 

4 

16 

4 

1 

0 

1 

25 

16 

4 

1 

9 

1 

16 

Here,  =  30,  =  180,  n  =  25.  Applying  the  t-test  given  in  Equation  13  gives  the 

following  result: 

30 

^(24) 

30 

=  2.45 
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The  critical  value  for  the  t-test  for  a  one-tailed  test,  df  =  25-  1  =  24  and  a  =  0.05  is  1.711 
using  a  standard  t-table.  Here,  the  null  hypothesis  can  be  rejected  since  the  obtained  value 
t(24)  =  2.45  is  greater  than  the  critical  value  1.711.  That  is,  the  result  is  extreme  enough 
according  to  our  level  of  accepted  risk  that  the  difference  between  the  pre  and  post¬ 
treatment  results  occurred  not  by  chance  but  was  due  to  the  treatment. 


4.5  ANOVA 

Previously,  we  examined  the  two-sample  t-test  used  to  compare  two  population  means 
and  1X2-  Analysis  of  Variance  (ANOVA)  is  a  hypothesis  testing  technique  testing  the 
equality  of  two  or  more  population  means.  Often  known  as  one-way  ANOVA  for 
comparing  the  means  of  more  than  two  groups  or  levels  of  an  independent  variable,  the 
analysis  involves  comparing  the  variances  of  samples  taken  due  to  differences  between 
individuals  within  groups  and  between  groups  (Salkind,  2008).  The  same  assumptions  for 
the  t-test  apply  to  ANOVA  and  include  (Smith,  1993): 

1.  normally  distributed  data 

2.  independent  samples  taken  from  each  of  the  treatments 

3.  the  population  standard  deviation  a  is  common  to  each  treatment. 

ANOVA  generalises  the  standard  t-test  to  enable  comparisons  oik>  2  population  means. 
As  a  result,  the  null  hypothesis  becomes  (Smith,  1993), 

Hq:  Ml  =  ^2  =  ■■■  =  Mfe 


against  an  alternative  hypothesis  that  at  least  two  of  the  population  means  differ.  Central 
to  a  k  >  2  comparison  of  means  is  the  following  total  sum  of  squares  (SStotai): 

(15) 

7=1  1=1 

Here,  X..  denotes  the  grand  mean  of  all  samples,  i  references  a  particular  observation 
given  by  X^j  in  treatment  j  while  Uj  gives  the  total  number  of  observations  in  treatment  j. 
Adding  and  subtracting  Xj  does  not  affect  the  final  sum  but  allows  SStotai  to  expand  to®. 


k  «/ 


ss„,  =  XX[N-T)+(T--^..)r 

7=1  (=1 

= t  z[h.  -  Tl + -  TlT  -  x..)+  (y  -  y.f . 

7=1  (=1 

= i  i;(y  -  y  U  2±±(x,  -  y  ly  -  Z..)+  ±±{x,  -  xj 


(16) 


7=1  (=1 


7=1  1=1 


7=1  (=1 


^  K.  McIntyre  (2005)  Notes  on  ANOVA,  http://ww2.mcdaniel.edu/Bus  Econ/mcintyre/ANOVA.PDF. 
accessed  April  2013 
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The  middle  term  here  can  be  crossed  off  because  deviations  from  the  mean  always  sum  to 
zero.  Since  the  third  term  makes  no  references  to  Xij,  SS^otai  can  be  written  as  follows 
(Smith,  1993), 


k  « 


SS 


total 


j=\  i=i 

=  SS 


y=i 


+  SS 

residuals  ^  ^  treatments 


(17) 


If  Sj  gives  the  sample  standard  deviation  of  the  7*  sample,  then  SS^esiduais  can  be  expressed 
as. 


SS,esMuals=ZZ(^^-^/ 
7=1  /=1 


while 


(18) 


SS 


treatments 


(19) 


SStreatments  measures  the  differences  between  groups  while  SSresiduais  captures  the  differences 
between  individuals  within  groups.  The  following  provide  two  choices  of  two  unbiased 
mean  square  estimators  of  cr^: 


where. 


and 


where. 


MS, 


SS, 


^treatments  ^  ^ 


SS 

jyjg  _  residuals 


•'residuals 


#rest, 


residuals 


^^residuals  ^  ^  * 


where  n  gives  the  total  number  of  measures, 

k 

n  =  Y,nj 

7=1 


The  total  degrees  of  freedom. 


#,otal  =«-l 


(20) 

(21) 

(22) 

(23) 

(24) 

(25) 


An  F-test  is  used  to  check  whether  the  variance  between  treatments  is  the  same  as  the 
variance  between  individuals  within  treatments  are  equal.  Using  ANOVA, 
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F  = 


MS 


treatements 


MS 


residuals 


(26) 


The  f-value  indicates  how  far  the  data  doesn't  support  the  null  hypothesis  as  a  deviation 
from  f  =  1.  A  large  value  of  F  implies  that  the  effect  of  the  treatment  is  relevant  because 
the  variability  between  samples  is  much  larger  than  the  variability  within  each  sample. 

The  critical  value  for  the  f-test  can  be  found  from  an  f-table  and  requires  knowing  the 
Type  1  error  rate  a,  d/treatments  The  null  hypothesis  is  rejected  when  the 

obtained  f-value  is  larger  than  the  critical  value  Ecrit-  The  computed  values  are  then 
summarised  in  an  ANOVA  table: 


Table  6:  Composition  of  a  one-way  ANOVA  table 

mean 


Source 

Treatments 

(between) 

Kesiduais 

(within) 


Freedom  Squares  Squares 


treatments 

residuals 


^^treatments  ElSu-eatmeirts  ^^treatments  /  ElSj-gg^^j^als  g 


The  following  example  from  Smith  (1993)  illustrates  applying  the  one-way  ANOVA  to 
support  hypothesis  testing: 

Samples  of  25  kg  packs  of  PVC  powder  were  selected  from  5  batches  coming  off  an 
assembly  line.  The  discrepancy  between  the  label  weight  and  its  weighed  value 
was  noted  in  units  of  x  10“^  kg.  It  is  assumed  that  the  5  samples  were 
independent,  normally  distributed  with  a  common  variance.  The  recorded  data  is 
as  follows: 

Table  7;  Measurements  from  batch  example. 


Batch  (/ 

e[1,fc])... 

(Treatments) 

1 

2 

3 

4 

5 

1 

0 

6 

7 

-25 

1 

2 

-15 

-28 

-13 

-45 

10 

£ 

3 

3 

8 

5 

-13 

17 

-45 

U) 

(0 

4 

36 

5 

-40 

1 

-15 

o 

2 

5 

7 

-19 

15 

-14 

-3 

6 

27 

-10 

-15 

1 

-28 

1-1. 

7 

10 

25 

17 

-5 

30 

c 

8 

15 

-14 

-38 

-50 

11 

III 

9 

15 

10 

-5 

10 

-5 

10 

24 

0 

-2 

-35 

-50 

11 

29 

15 

-37 

10 

9 

My 

14.18 

-0.45 

-11.27 

-12.27 

-7.73 

GM: 

-3.51 

-3.51 

-3.51 

-3.51 

-3.51 

M,- 

■GM: 

17.69 

3.05 

-7.76 

-8.76 

-4.22 

SDy 

14.55 

15.73 

20.42 

23.30 

24.73 

The  terms  referred  to  as  M,  are  the  means  calculated  over  a  single  sample. 


n 


UNCLASSIFIED 


21 


DSTO-TN-1291 


UNCLASSIFIED 


such  that  for  example. 


-  0-I5  +  8  +  36  +  7  +  27  +  I0  +  I5  +  I5  +  24  +  29 

Aj  = - - - =  14.18. 


II 


The  grand  mean  (GM)  is  given  by. 


X.  = 


7=1  i=l 


kxn 

0-15  +  8  +  .. .-5-50  +  9 


-193 

55 


5x11 
=  -3.5I 


The  standard  deviation  referred  to  in  the  table  as  SDj  are  calculated  over  a  single  sample. 


^7  = 


i=l 


n-l 


such  that  for  example. 


|(0-I4.I8f  +(-I5-I4.I8f  +...  +  (24-I4.I8f  +(29-14.18^ 

■'V  Tlx 


2II7.636 


10 


=  14.55 


The  ANOVA  table  for  this  example  is. 
Table  8:  ANOVA  table  for  batch  example. 


Source 

Freedom  Squares  Squares  p  Fcrit 

Treatments 

(between) 

Kesiduais 

(within) 

4  5248.84  1312.21  3.23  2.56 

^^treatments  ^rid  SSj.0gi(j|j3[g  are  Computed  as  follows, 

y=i 

=  I  IxI7.69XlIx3.05XlIx  (-7.76)" +IIx  (-8.76)" +IIx  (-4.22)" 
=  5,248.84 

and 
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./=1 

=  10x14.55"  +10x15.73"  +10x20.42"  +10x23.30"  +10x24.73" . 
=  20,306.91 

SStotai  is  therefore, 

cc  =  CQ 

total  residuals  treatments 

=  5,248.84  +  20,306.91 
=  25,555.75 

The  degrees  of  freedom  are  calculated  as  follows, 

^treatments  —  ^  ^residuals  ~  ^  ~  ^  ^total  ~  ^  “  1 

=  5-1  =55-5  =55-1 

=  4  =50  =54 

with  mean  squares. 


MS, 


ss. 


^^treatments 

5,248.84 


MS 


SS 


residuals 


residuals 


=  1,312.21 

The  obtained  value  for  the  f -value  is. 


p  _  -^^treatements 

residuals 

_  1,312.21 
^  406.14 
=  3.23 


^^^esiduals 

20,306.91 


50 

=  406.14 


fcrit  for  Type  1  error  of  a  =  0.05,  d/treatments  =  4  and  d/^esiduais  =  50  is  2.56. 


Since  the  obtained  value  of  3.23  is  larger  than  the  critical  value  of  2.56,  then  the  evidence 
supports  rejecting  the  null  hypothesis  -  at  least  two  of  the  batches  are  significantly 
different.  The  box  plot  in  Figure  6  shows  the  quartiles  of  each  sample  with  the  whiskers 
representing  the  smallest  and  largest  values  observed  (Coakes  and  Ong,  2011).  The 
diagram  shows  that  batch  1  is  the  likely  cause  of  the  rejection  of  Hq. 
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Figure  6:  Box  and  whisker  plot  of  the  batch  data  produced  with  SPSS. 

A  closer  look  at  Figure  6  shows  that  batch  2  is  highly  skewed  and  requires  a  further 
assessment  of  normality.  With  n2  =  11  (less  than  100)  a  test  involving  the  Kolmogorov- 
Smirnov  and  Shapiro- Wilk  statistics  is  used  to  confirm  this  requirement  (Coakes  and  Ong, 
2011).  Employing  SPSS  produces  significance  values  of  0.2  and  0.896  respectively,  both 
exceeding  0.05  and  thus  confirming  the  assumption  of  normality. 


4.6  Two-Way  ANOVA 

Section  4.5  detailed  the  one-way  ANOVA  for  observations  of  a  single  dependent  variable 
influenced  by  a  single  independent  variable  (the  treatment).  Cases  involving  multiple 
independent  variables  affecting  a  single  dependent  variable  involve  factorial  analysis  of 
variance  (factorial  ANOVA).  The  two-way  ANOVA  is  the  simplest  kind  of  factorial 
ANOVA  and  is  used  when  there  are  two  independent  variables  (Salkind,  2008).  The  two 
independent  variables  typically  comprise  of  a  factor  and  treatment  producing  an  outcome 
measured  by  the  dependent  variable.  A  two-way  ANOVA  firstly  applies  a  one-way 
ANOVA  on  the  independent  variables,  followed  by  an  analysis  of  whether  both  factors 
together  affect  the  outcome.  The  total  sum  of  squares  for  a  two-way  ANOVA  with 
independent  variables  A  and  B  is  (Moore  and  McCabe,  2003,  Sparks,  2011), 

ss„„=22Zfe-^-.)  „ 

i=i  j=\  k=\  • 

=  ss,  +  ss,  +  ss,,  +  ss,,,,„,. 

Here,  we  have  avoided  labelling  A  and  B  as  treatments  because  often  just  one  variable  is  a 
treatment  while  the  other  represents  a  factor.  Consider  factors  A  and  B  as  having  a  and 
b  levels  respectively,  and  observations  for  i  G  [1,  a]  and  j  G  [1,1)].  The  total  number  of 
observations  is  therefore. 
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n  =  (28) 

i=\  j=\ 

SS4,  SSg  and  and  SSresictuais  in  Table  9  are  computed  as  by  ignoring  the  unfocussed 
variable  and  performing  a  one-way  ANOVA  on  the  data  on  A  and  B  respectively  as 
follows 


a  b  ^ij 

ZZZI 

/=1  J=\  k=\ 
a  b  ^ij 

ZIK 

i=l  j  =  l  k  =  l 


a  b 


ss 


AB 


i—\  7=1  k=i 


and 


a  b 


SS 


residuals 


/=1  y=l  k=\ 
a  b  . 

zz(»„ 

i=l  j=l 


vviicic:  dij  ci-dic/tCD  ciic  v^cvidcic/ii  vvicii 

Table  9:  Composition  of  a  two-way  ANOVA  table 

- bums  AT - 

Source  iDegrees  of  Freedom  Squares 


Factor  A 
(between) 
Factor  B 
(between) 
Factor  AB 
(between) 
Kesiduals 
(within) 


(29) 

(30) 

(31) 

-x-J 

:  y/ 

1  2 

(32) 

3  i,j. 

Mean  Squares 


^crit 


df^  =  a  -1  SS/i 

rf/g  =b  -1  SSg 

dfAB  =  (fl  -  l)(fc  - 1)  SSab 


MSyi  =  SSa  /  dfj^ 
MSg  =  SSg  /  dfg 
MS^g  =  SSab  /  dfyig 


/  ^^residuals  t"  crit,A 

MSg  /  MSresiduals  t'^rg^g 

^^Ag  /  ^^residuals  t'^nt^AB 


Totai 


residuals  ii  -  db  SSj-ggiduals  ^^residuals  ^^residuals  /  residuals 


To  simplify  the  calculations,  the  statistics  package  SPSS  has  been  used  to  work  through  the 
following  two-way  ANOVA  example  from  Salkind  (2008). 


A  study  involves  studying  the  impact  of  gender  and  exercise  program  (two  independent 
variables)  on  weight  loss  (the  dependent  variable).  The  experimental  design  is  as  follows: 


Exercise  Program 


High  impact  Low  impact 

Maie 

Femaie 
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The  data  collected  from  the  experiment  is  as  follows: 

Treatment:  High  Impact  Low  Impact 

Gender:  Male  Female  Male  Female 


76 

65 

88 

65 

78 

90 

76 

67 

76 

65 

76 

67 

76 

90 

76 

87 

76 

65 

56 

78 

74 

90 

76 

56 

74 

90 

76 

54 

76 

79 

98 

56 

76 

70 

88 

54 

55 

90 

78 

56 

A  two-way  ANOVA  produces  three  null  hypotheses.  The  first  assesses  the  impact  of  the 
exercise  program  on  the  outcome  of  weight  loss. 


Hq-  Mhigh  ~  Mlow 

Hi-  -^high  ^  -^low 

The  second  assesses  whether  gender  has  an  impact  on  weight  loss, 

Hq-  Mmale  ~  Mfemale 

Hi-  -^male  ^  -^female- 

The  third  hypothesis  assesses  whether  there's  an  interaction  effect  of  exercise  program  and 
gender  on  treatment  on  weight  loss, 

Hq-  Mhigh.male  ~  Mhigh, female  ~  Mlow.male  ~  Mlow, female 
Hi-  -^hlgh.male  ^  -^hlgh, female  ^  -^low.male  ^  -^low, female - 

The  application  of  SPSS  on  the  data  produces  the  output 
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Table  10:  Applying  a  two-way  ANOV A  in  SPSS  on  the  data. 

Tests  of  Between-Subjects  Effects 


Dependent  Variable:  Loss 


Source 

Type  III  Sum 
of  Squares 

df 

Mean  Square 

LL 

Sig. 

Intercept 

Hypothesis 

218892.025 

1 

218892.025 

1057.322 

.020 

Error 

207.025 

1 

207.025® 

Treatment 

Hypothesis 

265.225 

1 

265.225 

.252 

.704 

Error 

1050.625 

1 

1050.625“^ 

Gender 

Hypothesis 

207.025 

1 

207.025 

.197 

.734 

Error 

1050.625 

1 

1050.625'^ 

Treatment*  Gender 

Hypothesis 

1050.625 

1 

1050.625 

9.683 

.004 

Error 

3906.100 

36 

108.503° 

a.  MS(Gender) 

b.  MS(Treatnnent*  Gender) 

c.  MS(Error) 


The  first  hypothesis  is  tested  as  though  the  data  excludes  gender  information  and  a  one¬ 
way  ANOVA  performed  on  the  treatment  of  exercise  program.  From  Table  10,  SPSS 
returns  a  p-value  of  0.704  (under  the  Sig.  column)  which  is  larger  than  a  =  0.05,  meaning 
we  fail  to  reject  the  null  hypothesis.  That  is,  the  choice  of  exercise  program  (either  high  or 
low  impact)  by  itself  has  no  effect  on  weight  loss. 

The  second  hypothesis  is  tested  on  the  data  while  ignoring  treatment  information  by 
performing  a  one-way  ANOVA  on  gender.  The  analysis  results  in  a  p-value  of  0.734  for 
gender  being  larger  than  a  =  0.05,  meaning  we  fail  to  reject  the  null  hypothesis.  We 
conclude  that  gender  (being  either  male  or  female)  in  isolation  has  no  effect  on  weight  loss. 

The  third  hypothesis  is  tested  by  considering  whether  both  independent  variables  of 
gender  and  exercise  program  have  an  effect  on  weight  loss.  The  p-value  of  0.004  under  the 
final  column  of  the  row  indicated  by  Treatment  *  Gender  is  less  than  a  =  0.05,  thus 
indicating  that  both  factors  affect  the  outcome.  In  other  words,  an  interaction  exists 
between  the  explanatory  variables  (Seltman,  2013)  and  we  conclude  that  the  effect  of 
changes  in  the  exercise  program  depends  on  the  gender  (the  other  factor). 


4.7  MANOVA 

Multivariate  analysis  of  variance  (MANOVA)  extends  ANOVA  by  enabling  the  analysis  of 
multiple  dependent  variables  (DVs).  Remember  that  ANOVA  tests  for  the  truth  of  the  null 
hypothesis  that  the  means  of  k  treatments  are  equal: 

^0  •  Ai  ~  Ml  ~  ~  Mk  •  (33) 
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MANOVA  tests  for  the  truth  of  the  null  hypothesis  that  k  treatments  with  a  p-dimensional 
vector  of  population  means  are  equal  (Bray  and  Maxwell,  1985): 


All  “  Ai2  —  •  •  •  —  An 
A21  “  A22  =  •  •  •  =  Mik 


(34) 


Api  “  Ap2  —  •  •  •  —  fJ-pk 

allowing  relationships  between  DVs  to  be  assessed.  Whereas  ANOVA  only  uses  the  f-test 
to  form  a  test  statistic,  MANOVA  uses  several  including: 

•  Wilks'  lambda 

•  Pillai-Bartlett  trace 

•  Roy's  greatest  characteristic  root 

•  Hotelling-Lawley  trace. 


These  statistics  are  derived  from  the  full  and  reduced  models  of  errors  associated  with  the 
data  Xij  for  trial  i  in  treatment  j.  For  the  full  model,  the  individual  error  estimate  is  given 
by, 

(35) 

and  is  applied  to  each  DV.  The  sum  of  squared  errors  given  by, 

sse(f)=x2;'(«(9 

'  ‘  (36) 

1  ‘ 

is  also  referred  to  as  the  sum  of  squares  within  treatments  SS^esiduais-  Similarly  for  the 
reduced  model,  the  individual  error  estimate  given  by. 


e,{R)=X,-X. 

is  also  applied  to  each  DV.  The  sum  of  squared  errors  given  by, 

sse(r)=2;x^.;(R) 

./■  i 

./  ' 

is  also  called  the  total  sum  of  squares,  SStooi.  The  sum  of  squares  between  treatments, 

SS  =  SS  -SS 

‘^treatments  total  ‘“^residuals 

=  4x,-x.f 

where  nj  represents  the  number  of  trials  in  treatment  j. 


(37) 


(38) 


(39) 
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For  simplicity,  the  remainder  of  this  section  will  focus  on  the  two  DV  based  MANOVA. 
Table  11  illustrates  computing  the  errors  for  full  and  reduced  models  for  such  a  problem. 
For  simplicity,  errors  associated  with  DV^  are  labelled  under  and  those  related  to  DV2 
labelled  under  62  for  both  models. 


Table  11:  Errors  for  Full  and  Reduced  Models  for  two-way  MANOVA  data 


Treatment  Trial 

Full  Model 

61  62  {e^Y  (©2)^  6162 

Reduced  Model 

61  02  (61)“^  ©162 

1 

1 

i 

Sum: 

1 

/' 

Sum: 

1 

j 

i 

Sum: 

Grand  Sum: 

forDV, 

-  X,  for  DV2 

forDV, 

j^^^X,,-X,  forDV^ 

fcrDV, 

forDVj 

forDV, 

The  F-test  is  the  basis  of  the  univariate  test  of  significance,  comparing  against 

for  DVi  and  separately  for  all  other  DVs.  MANOVA  also  considers  the 
relationship  between  DVs  by  also  comparing  against  Zy  ZiSiSzCR)-  These 

sums  of  cross-products  is  closely  related  to  the  correlation  between  the  two  variables 
where. 


and 


'Y.Y- 


(within  treatments) 


V  V  i  7  i 


(total  sample) 


ZZ‘’i‘’2h) 

JZZArEZAr)' 

V  j  ^  j  ' 


(40) 


(41) 


The  inclusion  of  cross  product  terms  for  MANOVA  results  in  the  following  error  matrix 
for  the  full  model. 
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j  i  J  i 

and  the  following  error  matrix  for  the  reduced  model, 

ZZ^i^2(r) 
ZZ^i^2(r)  SZ^2(r) 


T  = 


J  I 


J  I 


(42) 


(43) 


The  hypothesis  sum  of  squares  and  cross-product  matrix  is  given  by,  H  =  T  —  E.  The  four 
multivariate  test  statistics  for  MANOVA  referred  to  above  are  all  based  on  eigenvalues  Tj 
of  HE“^  where  i  =  1, ... ,  s.  The  test  statistic  for  Wilks'  lambda  is. 


u=u 


I 

1  + A,. 


(44) 


The  formula  for  the  Pillai-Bartlett  trace  is  given  by, 

F=E— 

s;i+2, 


while  the  test  statistic  for  Roy's  greatest  characteristic  root  is. 


GCR  = 


I  +  A,  ■ 


The  formula  for  the  Hotelling-Lawley  trace  is. 


(45) 


(46) 


(47) 


Determining  whether  the  null  hypothesis  should  be  rejected  requires  comparing  the 
observed  value  of  the  test  statistic  to  the  sampling  distribution  of  the  statistic  under  the 
null  hypothesis.  This  requires  a  transformation  of  each  test  statistic  to  a  variable 
approximating  the  F-distribution  with  the  intention  of  deriving  a  p-value.  As  SPSS 
produces  the  desired  outputs  automatically,  we  won't  be  detailing  the  calculations. 
However,  Bray  and  Maxwell  (1985)  summarise  the  transformations  from  the  U  and 
P-statistics  to  those  approximating  an  F-variable  for  those  interested  in  the  calculations. 
The  resulting  p-values  are  compared  to  the  chosen  significance  level  a  for  accepting  or 
rejecting  the  null  hypothesis. 


4.8  Mann-Whitney  fZ-Test/ Wilcoxson  Signed-Rank  Test 

The  Mann-Whitney  U -test  is  considered  the  non-parametric  equivalent  of  the  independent 
samples  t-Test  and  can  be  performed  on  ordinal  (ranked)  data.  The  Mann-Whitney  U-test 
is  well  suited  to  Likert  item  data  as  it  cannot  be  presumed  that  the  underlying  population 
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fits  a  normal  distribution  and  the  Mann- Whitney  U-tesi  is  included  in  most  modem 
statistical  packages.  The  analysis  of  individual  Likert  items  will  be  in  terms  of  determining 
if  there  is  a  statistical  difference  in  responses  between  the  two  treatments. 

Although  the  Mann-Whitney  test  does  not  require  normally  distributed  data  it  does 
require  that  the  data  from  each  population  must  be  an  independent  random  sample,  and 
the  population  distributions  must  have  equal  variances  and  the  same  shape.  Equal 
variance  can  be  tested  through  the  application  of  a  non-parametric  version  of  the  Levine's 
test  for  homogeneity  of  variance.  This  test  can  also  be  used  to  support  required 
assumption  that  the  two  samples  come  from  the  same  distribution  shape.  In  addition,  a 
visual  inspection  of  the  probability  plot  may  help  in  determining  if  the  distributions  look 
similar. 


Testing  the  Likert  item  results  consist  of  creating  a  null  and  alternative  hypothesis.  The 
null  hypothesis,  Hq  is:  The  samples  come  from  the  same  distribution,  or  there  is  no 
difference  in  ranks  between  the  treatments.  The  alternative  hypothesis,  is:  The  samples 
come  from  different  distribution,  or  there  is  a  significant  difference,  typically  a  =  0.05. 

The  Mann  Whitney  f/-statistic  is  defined  as^O: 

TT  ^2(^2  “*"l)  ^  n  //I 

U  =  n,n,+^ - --LRi  (4 


where  i?j  are  the  rank  values. 

The  [/-test  can  be  calculated  by  hand  for  small  data  sets.  The  calculation  procedure  is 
conducted  as  described  below: 


1. 

2. 


3. 


Rank  the  results,  where  there  is  a  tie  use  the  average  value. 

Add  up  the  ranks  for  the  observations  which  came  from  sample  1  (/?i)  and  use 
the  expression  bellow  to  calculate  U^: 


U,  =  M,N,  + 


w/y.+i) 

2 


(49) 


Eor  sample  size  and  sum  of  ranks  in  sample  1. 
Use  the  expression  below  to  calculate  U2, 


(50) 


Use  the  smaller  value  of  and  II2  when  consulting  significance  tables.  Note:  The  Mann- 
Whitney  [/-statistic  follows  a  Z-distribution  when  the  sample  size  is  greater  than  20.  Eor 
values  less  than  20  refer  to  the  Mann-Whitney  Critical  Value  table. 


***  Stats  Direct,  Mann  Whitney  U-Test, 

www.statsdirect.eom/webhelp/#nonparametric  methods/mann  whitnev.htm.  accessed  April  2014 
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5.  Summary 

A  hypothesis  has  to  be  capable  of  being  tested  by  an  experiment.  Well-designed 
experiments  provide  a  systematic  approach  to  observe  relations  among  variables  while 
ruling  out  alternative  explanations.  This  report  has  examined  a  number  of  experimental 
designs  including  the  simple  experiment,  matched-pairs,  repeated-measures  and  the 
single  group  design.  While  uncontrolled  experiments  such  as  those  conducted  in  the  field 
do  not  mitigate  all  confounding  factors,  they  provide  external-validity  to  support  the 
results  from  controlled  experiments. 

A  difficulty  with  human-in-the-Ioop  experiments  is  the  requirement  to  gauge  behaviours, 
characteristics,  attitudes  and  opinions.  An  approach  to  generate  quantitative  data  from 
such  responses  is  to  use  Likert  scales.  Once  collected,  the  data  can  be  analysed  using 
descriptive  statistics  or  compared  using  a  number  of  approaches  to  visualise  results.  These 
approaches  reveal  differences  between  the  specific  groups  but  to  generalise  the  results  to  a 
larger  population  requires  inferential  statistics. 
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